Personal Loan Campaign Project 4

Brandy Murray

Description

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Data Dictionary

  1. ID: Customer ID
  2. Age: Customer’s age in completed years
  3. Experience: #years of professional experience
  4. Income: Annual income of the customer (in thousand dollars)
  5. ZIP Code: Home Address ZIP code.
  6. Family: the Family size of the customer
  7. CCAvg: Average spending on credit cards per month (in thousand dollars)
  8. Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  9. Mortgage: Value of house mortgage if any. (in thousand dollars)
  10. Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
  11. Securities_Account: Does the customer have securities account with the bank?
  12. CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
  13. Online: Do customers use internet banking facilities?
  14. CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)?
  15. Longitude: A longitude is an angle from the prime merdian, measured to the east (longitudes to the west are negative)
  16. Latitude: Latitudes measure an angle up from the equator (latitudes to the south are negative)

Load the Libraries & Explore the Data

Import Libraries

Load and Explore Data

Looking at these 3 sections of rows, we can see that it'll be safe to drop the included ID column and just work with the generated ID because the numbers are the same only one off with python starting at 0.

We can also see that there are several columns that we will want to combine into groups to make less columns, such as Age, Experience, Income, and ZIPCode. I also added the Longitude and Latitude coordinates to see if we can gain any perspective from the location of these cities.

Here we can see that all the column titles are in good shape with no '-' or '.' in the column names. I initially imported ZIPCode as an object since zipcodes do not give any numerical value. As I look at both Family and Education I belive these can be converted as well.

Here we can see that there are no missing values with all the columns having a count of 5000. A further investigation will be needed to make sure each of those values is valid though.

Age: The average age of people in the dataset is 45, age has a range from 23 to 67 years.

Experience: The average amount of experience in years is 20. Something that will need to be further investigated is the -3 years that is showing in the min column.

Income: The average for Income is \$64,000 but there is a very large range in this column with the min being \\$8,000 and the max being \$224,000.

Here we can see that in Experience there are 4 counts for -3, 15 counts for -2 and 33 counts for -1. This is obviously an error because you cannot have a negative amount of experience. I thought for a second, maybe these values could refer somehow to a person being in high school still. But after looking at the ages no one is young enough. Therefore, I am going to assume the '-' is the error and remove it. Then these counts will be added to the totals for 1, 2, and 3. We can also see that under Mortgage 3,462 of the 5,000 values are 0. I am assuming that means that these people do not have a mortgage from the wording in the description. We will have to see if this high value of 69% means anything when it comes to whether or not these people purchase a loan.

As I can tell so far the data in the columns are looking good. Before I start the analysis portion though I wanted to make some bins to help manage some of the columns that contained many different values.

3. Processing Columns

4. Perform an Exploratory Data Analysis

Univariate Analysis

Summary Statistics of Numeric and Non-Numeric Variables

Here we can see that there are some spikes in certain ages and experience. If it wasn't for these spikes, the data would be fairly flat across the top. Income is skewed to the right which we saw before when more than 75% of the income was below \$98,000 but the max value spiked up to $214,000. CCAvg is also right skewed. Mortgage is displayed very oddly since there are 3,462 zeros. Personal Loan, Securities Account, CD Account, Online, Credit Card are all boolean values. I did think that the median household income was really interesting. It appears to be almost normally distributed with a slight right skewedness.

Log Transformation

I wanted to see what would happen if I tried to take the log of Mortgage. Here we can see the different option didn't change the way the distribution looked. I thought this might happen but I wanted to try and see.

Bivariate Analysis

Here were can see that Age and Experience are highly coorelated. It looks like Income is highly coorelated with CCAvg and Mortgage, but when we look at the numbers the correlation is only a 0.65 and 0.21 respectively.

Looking at the coorelation from Personal Loan shows there are not any super highly coorelated variables.

5. Model Evaluation Criterion

Model can make wrong predictions as:

  1. Predicting a customer is not going to purchase a loan but in reality the customer will purchase a loan. -Loss of Resources.
  2. Predicting a customer is going to purchase a loan but in reality the customer will not purchase a loan. -Loss of Revenue.

**Which loss is greater?

How to reduce this loss?

Data Preparation

Logistic Regression (with Sklearn library)

Checking performance on training set

Checking performance on test set

Observations

Logistic Regression (with statsmodels library)

Observations

After running the VIF test, the first time I found that Age and Experience had a very high score. Therefore, I dropped Experience.

There were also high scores between the bins I created and the original columns. So first I am trying to drop the bin columns I created.

After that, there was a high score between City, County, and ZIPCode.

Observations

Additional Information on VIF

With the multicollinearity removed, I want to try the df2 dataframe on the Logistic Regression (with Sklearn library)

Observations

Now, since none of the variables exhibit high multicollinearity, we know the values in the summary are reliable. Now we can remove the insignificant features. To start with I am going to remove the 4 counties that had a p-value of 1.000.

Hmmmm, I have multicolinarity problems again.

Here we can see that age has the highest score over 5.

I feel like I need to start over with the original df dataframe and then try dropping other variables. I am not sure how I would do that. Would I just do another xtrain and equal it to df? I will try this if I have time but I only have 2 hours left and need to do the decision tree.

Decision Tree

Split Data

Build Decision Tree Model

Scoring our Decision Tree

What does a bank want?

Which loss is greater ?

Since we want people to purchase a loans we should use Recall as a metric of model evaluation instead of accuracy.

Visualizing the Decision Tree

Reducing over fitting

Confusion Matrix - decision tree with depth restricted to 3

Visualizing the Decision Tree

Using GridSearch for Hyperparameter tuning of our tree model

Confusion Matrix - decision tree with tuned hyperparameters

Visualizing the Decision Tree

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

https://online.stat.psu.edu/stat508/lesson/11/11.8/11.8.2

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Accuracy vs alpha for training and testing sets

When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 69% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

Since accuracy isn't the right metric for our data we would want high recall

Confusion Matrix - post-pruned decision tree

Visualizing the Decision Tree

Comparing all the decision tree models

Decision tree with post-pruning is giving the highest recall on test set.

Observations

We can use these models to predict whether a customer will or will not buy a loan. In both models we were able to see that someone with an undergraduate education and a family with 3 to 4 members were the deciding factors to look at along with income on whether someone was going to purchase a loan or not. If I was going to evaluate this model further, I would want to use the map I created to pinpoint where these customer reside. Then I would want to look at the branches in those areas to run the campaign.

We were able to see there were some counties that had a low enough p-value to have some significance. I would start looking at those counties first and see if the variables of undergraduate education with 3 to 4 family members resided there. I also would want to rerun the logistic regression model before submitting a formal recommendation to the CEO/CFO. I believe I could build a stronger model by changing the variables I dropped.